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Abstract 

In cancer research, the comparison of gene expression or DNA methylation networks 
inferred from healthy controls and patients can lead to the discovery of biological path¬ 
ways associated to the disease. As a cancer progresses, its signalling and control networks 
are subject to some degree of localised re-wiring. Being able to detect disrupted inter¬ 
action patterns induced by the presence or progression of the disease can lead to the 
discovery of novel molecular diagnostic and prognostic signatures. Currently there is a 
lack of scalable statistical procedures for two-network comparisons aimed at detecting 
localised topological differences. We propose the dGHD algorithm, a methodology for 
detecting differential interaction patterns in two-network comparisons. The algorithm re¬ 
lies on a statistic, the Generalised Hamming Distance (GHD), for assessing the degree of 
topological difference between networks and evaluating its statistical significance. dGHD 
builds on a non-parametric permutation testing framework but achieves computation¬ 
ally efficiency through an asymptotic normal approximation. We show that the GHD is 
able to detect more subtle topological differences compared to a standard Hamming dis¬ 
tance between networks. This results in the dGHD algorithm achieving high performance 
in simulation studies as measured by sensitivity and specificity. An application to the 
problem of detecting differential DNA co-methylation subnetworks associated to ovarian 
cancer demonstrates the potential benefits of the proposed methodology for discovering 
network-derived biomarkers associated with a trait of interest. 


1 Introduction 

Gurrent efforts at understanding diseases rely on the ability to identify differences between 
healthy and affected tissues. A number of high-throughput platforms are now commonly 
used to compare genome-wide molecular profiles collected from large cohorts of healthy 
and diseased subjects in search for patterns that differentiate between them. For instance, 
in cancer research, gene expression and DNA methylation profiles from diseased tissues 
are compared to those extracted from normal controls in order to identify groups of genes 
whose expression or methylation levels are significantly different, and consequently associ¬ 
ated to the trait of interest. From a statistical modelling standpoint, the primary interest 
of these studies lies in detecting statistically significant changes in average gene expres¬ 
sion or methylation values in a two-sample comparison. A number of standard statistical 
tests, which are generally applied in a univariate fashion, have been proposed for this task 
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and generate candidate sets of genes for further investigation [42]. Statistical methods 
have also been developed to assess whether these candidate genes are over-represented in 
pre-defined biological pathways or subnetworks within protein interaction networks |34j . 
These developments are based upon the principle that, in order to understand the roles 
of genes in complex diseases, genes need to be studied in the context of the regulatory 
systems they are involved in [33 [S3 [SS] • 

An alternative way of analysing genome-wide expression and methylation levels ob¬ 
served in a random sample consists of studying their interaction patterns, which are often 
represented in the form of networks mm- Network edges quantify the similarity in 
transcription activity between two genes [SH] or in DNA methylation between two CpG 
islands m, respectively. The notion of similarity is usually measured by linear corre¬ 
lation, partial correlation or mutual information coefficients estimated from the sample 
data [ini [To] . The networks arising in the two-sample setting above can then be compared 
to assess whether there are statistically signihcant differences in network topology that 
can be associated to the disease. The detection of markedly distinct interaction patterns 
across conditions may be indicative of local disturbances within known biological path¬ 
ways, and can be taken as candidate biomarkers. For instance, as a cancer progresses, it 
has been observed that its signalling and control networks are subjected to re-arrangments 
which are advantageous for the cancer [3]. Changes in methylation levels are believed to 
be among the earliest and most common alterations in human cancers |43L [S] , and topo¬ 
logical differences in healthy and diseased networks can reflect significant dysregulations 
associated to the disease m- 

In this paper we discuss the the problem of comparing two labelled biological networks, 
each one representing a different population or condition, with the aim of detecting statis¬ 
tically significant differences between them. We approach this problem from a hypothesis 
testing perspective. This is a challenging statistical problem as only one random network 
is observed under each condition. Various computational methodologies have been de¬ 
veloped to compare networks, including graph matching and graph similarity algorithms 
[8] . Graph matching algorithms have been used to discover similarities between molecular 
pathways across organisms and functions [HllS], but are typically limited to unlabelled 
graphs, and are not concerned with hypothesis testing. Graph similarity algorithms also 
assume that the graphs are unlabelled, and the attention has mostly focused on detect¬ 
ing patterns that are most similar between networks m- For instance, gene modules 
can be identified separately in each network first, and then compared across networks 
[13121110]. More closely related work includes inferential methods for performing two- 
sample hypothesis tests where the sampling unit is a network, and assess whether the two 
paired networks come from the same assumed model |48| . 

We take a non-parametric approach to inference that does not require to make assump¬ 
tions about a specific random network model. Our premise is that any true topological 
differences between the two networks would involve only a smaller set of edges, compared 
to all edges in the network, which we aim to detect. Our contributions to this problem are 
as follows. First, we consider the issue of choosing a distance measure between two paired 
networks that is able to capture subtle topological differences. Second, we discuss how 
to establish whether large values of this distance can be deemed statistically significant 
under a null hypothesis that the networks are independent. Finally, we ask whether it 
is possible to identify a differential subnetwork, starting from two large networks, in a 
computationally efficient manner. 

The article is organised as follows. In Section [^ we introduce a distance for la¬ 
belled networks, the Generalised Hamming Distance (GHD). Building on this distance, a 
permutation-based test statistic for two-sample network comparisons is introduced in Sec¬ 
tion [^ Conditions for asymptotic normality are provided so that p-values can be obtained 
in closed-form without the need to carry out computationally expensive permutations. In 
order to verify these results in special cases, in Section [^ we argue that the proposed 
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conditions hold true for various random network models, and provide a sketch proof for 
the case of scale free networks. In Section we describe an algorithm, dGHD, for the 
detection of differential subnetworks. In Section we present a number of simulation 
experiments that highlight the advantages of the proposed methodology under different 
graph models. As an illustrative application of the proposed methodology, a case-control 
study involving DNA co-methylation networks in ovarian cancer is presented in Section 
0 We conclude with a discussion in Section [HI 

2 The generalized Hamming distance 

We assume to have observed two paired biological networks, each represented by a graph, 
denoted hy A = {V,Ea) and B = {V,Eb), respectively. Both graphs are defined on a 
common set, V = {1, 2,..., A^}. The respective sets Ea and Eb of edges indicate the 
connection between the nodes in the two graphs. We also let the matrices A = (A^ ) and 
B — {Bij) denote the two {N x N) adjacency matrices associated with graphs A and B, 
respectively. 

The Hamming distance (HD) between A and B provides a commonly used metric to 
quantify the difference between the networks, and is defined by ^tr[(A — H)^], where tr[-] 
denotes the trace of a matrix. This distance takes into account the number of edges that 
are in common between the two networks. Here we propose an extension of this metric, 
which we call the Generalised Hamming Distance (GHD), dehned as 

= N(N-l) 

where and are mean-centred edge weights defined as 

<3 = - 7V(7V-1) " N{N-l) ^ 

ij ^ ^ hj 

and j denotes summation over distinct i and j. The edge weights, which depend on the 
topology of the networks, provide a measure of connectivity between every pair of nodes i 
and j in A and B, respectively. When atj and bij are binary values indicating the presence 
or absence of an edge, i.e. are the elements of A and B, GHD(A, H) is related to the HD. 
The specific node weights we propose here instead quantify the topological overlap (TO) 
between a pair of nodes by taking into account the local neighbourhood structure around 
those nodes [21]. In the literature, the TO measure has been successfully applied for the 
detection of communities in biological networks, and there is empirical evidence that it 
carries biological meaning [49111]. 

We use the one-step TO between nodes i and j indicating whether they share direct 
connections to other nodes. The weights are obtained from the adjacency matrix as 
follows: 

+ Ay 

“ niin(fi;^^ Au - Aij , Aa - Ay) + 1 ’ ^ 

when i j, and otherwise Oy- = 1, and analogously for 6y. The GHD sums the 
squared differences (aO — bA)^ over all pairs of nodes in the network. Note that the 
term j is a count of all vertexes (i,l,j) containing node pair (z,j). This 

term measures the connectivity information of each (f, j) pair plus their common one-step 
neighbours. The denominator in (§ can be written as min(dy dj) -I-1 — Ay, where di and 
dj represent the node degrees of i and j, respectively. It is roughly equal to the smaller 
of (di,dj) and normalises aij such that 0 < Oy < I. A large discrepancy between a'ij and 
b'ij indicates a topological difference localised around that pair of nodes. 
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(a) reference network 


(b) HD=0.0444 GHD=0.0123 



(c) HD=0.0444 GHD=0.0181 (d) HD=0.0444 GHD=0.0395 

Figure 1: Networks (b), (c) and (d) are generated from the reference network (a) by a single 
edge change. Both HD and GHD between the reference network and each modihed paired 
network have been computed in each case. 


By exploring the neighbourhood of each node, the proposed GHD can detect subtle 
topological changes with higher sensitivity compared to the HD. A simple illustration of 
this is given in Figure where four simple networks are shown: the network labelled 
(a) is taken as reference while the three paired networks (b), (c), and (d) have been 
generated by changing the position of a single edge in (a). The two distances, HD and 
GHD, have been computed to quantify the difference between (a) and each of the other 
three networks. It can be observed that, whereas the HD is unable to distinguish between 
the three networks, the GHD score is more sensitive to subtle topological variations and 
can discriminate between them. 


3 A non-parametric test for network comparison 

For inferential purposes, we require computing the probability that a distance as extreme 
or more extreme than the observed GHD value could have been observed by chance 
only. By treating the GHD as a random variable with unknown sampling distribution, 
this probability can be estimated non-parametrically via permutation testing. First, we 
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specify the null hypothesis as being 


Hq : networks A and B are independent. 


(3) 


By taking B as reference, each permutation consists of shufSing the labels of the nodes 
in A while keeping the edges unchanged. This generates a permuted network Ayr- that is 
isomorphic to A, and the exchangeability property holds. In turn, this signifies that the 
original and permuted networks are generated from the same underlying, but unspecified, 
model Emn]. Since all permutation networks are isomorphic, permuting the labels of the 
network is equivalent to shuffling rows and columns of the adjacency matrix, an approach 
that bears some similarity with Mantel’s test [33] for the comparison of two distance 
matrices. All the the N! possible permutations are then collected in a set H, and for each 
TT S n a permuted GHD value is denoted as 

and is calculated from the edge weights after permutation. The exact permutation 

distribution is obtained by carrying out an exhaustive calculation of all GHD,^ values, and 
p-values can then be evaluated as usual. In practice, however, doing so is computationally 
infeasible because the cardinality of 11 is generally extremely large, even for relatively 
small networks. The exhaustive evaluation for all permutations in 11 could be replaced 
by a Monte Garlo approach whereby only a smaller number of random permutations are 
explored. Nevertheless, the overall computational costs remain high for networks of the 
moderately large sizes observed in applications or when this procedure has to be repeated 
several times, for instance when searching for a differential subnetwork as in Section]^ 

In what follows, we propose an alternative approach that removes the need to carry 
out computationally expensive permutation testing altogether. We demonstrate that, 
under our null hypothesis, the exact GHD permutation distribution can be approximated 
well by a normal distribution with moments that can be obtained analytically, in closed 
form. First, we notice that the GHD can be rewritten in an equivalent form in terms of 
a generalised correlation coefficient as follows: 

GHD^(X,e) = c - ^ ^ (4) 

' ' id 

where c is a constant that does not change under permutations. By making use of this al¬ 
ternative representation, we are able to exploit well-known sufficient conditions for asymp¬ 
totic normality, which can also be easily checked in practice. For a generalised correlation 
coefficient of this form, the exact permutation distribution is asymptotically normal under 
two sufficient conditions [11121135]: 


H <3 = = 0 and 


y [J2ijki<3<Ai? 


lim 

N^OO 


ITyyjk Kjb'ik? 


= 0 . 


(5a) 

(5b) 


Gondition (5a) follows directly from the definition of aG and a s being mean-centred. 
In order to gain some insight into the meaning of condition (5b I in our context, it is 
instructive to consider the case where Oy and bij are elements of the two adjacency 


matrices, i.e. they indicate the presence of an edge, 
a = ;^ ^lave 

_ f \ ^ 


On defining Ui. = 


O'ij 


jA 


— (Xi. Oj. 


and 


( 6 ) 
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and condition (5b), with reference to network A, can be written as 

E.K-a)='’ 




'1213 


= 0 , 


( 7 ) 


and analogously for B. It can be observed that, when using the adjacency matrix, Ui. 
represents the degree of the node. An analogous condition also applies to B. Therefore, 
checking (5b I amounts to computing the degree of each node in the two networks, and 


assessing the limiting behaviour. When the TO measure is used instead, as in the GHD, 
the coefficient represents the overall topological overlap information at node i, and can 
also be computed using & 

When both (5a) and (5b) hold true, under the null hypothesis, the permutation dis¬ 


tribution of GHD(A, S) is approximately normal. We then standardise the GHD value 
by mean-centring and normalising it, so that it follows a standard normal distribution 
asymptotically. 


N(0,1) 


( 8 ) 


where fij,- and (T,r are the mean and standard deviation of GHD under the exact permu¬ 
tation distribution, respectively. These two moments can be computed precisely and in 
closed-form by enumerative combinatorics; the calculations follow developments described 
in the context of related permutation-based testing procedures [33] , and can also be found 
in [35]. Here we provide explicit formula for both and as follows. First, we need to 
define 


N N 


N N 


= ^ = and Ta = ^C^a,jf 

i—1 j — l i—1 

N N N N 

= ^ = 1’2 and n = 

i—1 j—1 i—1 j—1 


where a\j and feF are edge weights with power t. Here and are empirical 

raw moment of edge weight a^-, and analogously for bij. Furthermore we need to introduce 
the following quantities, 

Aa = CSaf, Ba=Ta-CSa), and Ca = Aa + 2CSa) - 

At = CSi,f, Bi, = Tt-CSb), and Ct = A^ + 2(^3^) - 

Then, closed-form expressions for the mean /i,r and variance are, 

^Sa+^Sb 2{^Sa)eSb) 

4 , I 4(i?a)(i?b) {Ca){Cb) 

N-2 {N-2){N-3) N{N - 1)' 

With the expressions for the first two exact moments, a corresponding p-value can there¬ 
fore be efficiently computed from the normal approximation, even for very large networks. 
We will exploit the computational efficiency gained here in Section]^ where we apply the 
test repeatedly on networks of increasingly smaller size in order to detect differential 
subnetworks. 


4 Validation of asymptotic normality on scale-free net¬ 
works 


The closed-form approximation for the computation of p-values only requires that con¬ 
ditions (5a) and (5b) are satisfied, and does not need any random network model to be 
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specified. These two conditions can also be verified analytically in special case when cer¬ 
tain random network models are assumed. For instance, in |39j it was proved that these 
conditions hold true for scale-free (SF), random geometric (RG) and Erdos-Renyi (ER) 
network models when using both HD and GHD distances. In this section we provide a 
simplified proof for the case of SF networks using the Hamming distance. This proof 
should serve as an illustration of how these derivations can be carried out analytically, 
and as simple validation of the methodology described in Section for SF networks. An 
analogous proof using the GHD distance can be found in the Supplementary Material, 
and we refer the reader to [53] for the other models. 

A SF network is a network whose node degree distribution follows a power law, at least 
asymptotically, and has often been used to describe real biological networks naiiaisii. 
The degree of each node is assumed to be an independent and identically distributed (HD) 
random variable with probability mass function defined as 

P{di = k) = ck~^, k = m,m-\- (9) 


where m and K are the lower and upper cut-offs for the node degree, respectively, c is a 
normalising constant, and a represents a power exponent. It is generally assumed that a 
is greater than 1, and the lower cut-off m is generally be taken to be I. The upper cut-off 
AT for a > 2 is conventionally specified as K = [T5], and generally K = N — 1 

for I < a < 2. Values of a for different biological networks have been characterised, and 
mostly vary between 1.4 to 1.7 [T5] . 

On defining the weights and bij as elements of A and B, respectively, Q becomes 


lin. = 0 


( 10 ) 


where d is the average node degree. In order to study this limiting behaviour, we exploit 
the fact that both numerator and denominator are powers of the centralised empirical mo¬ 
ments of the node degree distribution. We let /ig = cJ2d=i denote the theoretical 
moment and nis = X]i=i ^^e corresponding empirical moment of this distribution. In 
order to study the limit above we need to characterise the order of rus, for s = 1, 2,3, as N 
increases. Our strategy here consists of first characterising the order of asymptotically, 
for the first three moments, and establishing a correspondence with ms- 

We start by examining the order of fj-s-, for s = 1, 2,3, in the limit. Since this depends 
on s, we consider three distinct cases: (a) s — a -I- 1 < 0, (b) s — a -f 1 = 0 and (c) 
s — a -I- 1 > 0. For (a), the order of is J2d=i = 0(1). For (b), the order of 

d's is = 0(ln(Ar)). Finally, for (c), we need to study how increases with K. 

First, we apply the Euler-Maclaurin formula. 


K pK 

^ -h (a - s) / 

d=l 




s+1 


dx + 0(1), 


where [xj denotes the largest integer that is not greater than x. To compute the order of 
we need to know which one of the two terms in the sum dominates in order. 
By applying rHospital’s rule we have 

sJi ^ _s_ 

if^oo a:s-«+i s-d-hl’ 

which is a finite constant, and hence fig has the same order as For a SF network, 

the condition for asymptotic normality also depends on the values taken by the exponent. 
In the case where 1 < a < 2, for which K = N —1, the calculation of the moment falls 
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under case (c), hence we conclude that the order of the first three theoretical moments 
are, respectively, and 

We now turn to the direct comparison of the orders of /Xg and rUg in the limit. Specif¬ 
ically, we assess whether the order of each /ig established above also holds true for the 
corresponding rrig. This can be verified by checking that 


ms 

hm - = Cg 

N^OO l_lg 


( 11 ) 


for s = 1, 2, 3, and for some positive constants Cg. To study the above limit, we apply the 
Weak Law of Large Numbers (WLLN). For the WLLN to hold, ^g must be finite. Hence 
we first transform di so that fig, after the transformation, is finite. We let Ng = , 

and define Zgi = . The distribution of Zgi is 


P{Zsi = 

where c' = cNg. Thus the 

dzs = c' ^ 


j) = c'0- 


1 2 


iVg’fVg’-’ 
theoretical moment of z,, is 




fJ-S 


iV«+i 


K 


Ng 


OL ’ 


which is finite. Denoting by m^g the empirical moment of Zgi,i = 1, ...,fV, we have 


m^g 



i=l 



mg 

J^s+l-a ■ 


Now, since /i^g is finite and since di,d 2 , ■■■,dN are assumed IID, Zgi, Zs 2 , ...ZgN are also 
IID, and according to the WLLN, m^g converges to ^zs in probability. Hence we have 


1 = 


m. 


lim 

Af-foo fl^g 


lim 

N —¥00 


rris 

Ms 

AT^+i-c 


lim 

N^oo 


rUg 
ds ’ 


indicating that mg and fXg are of the same order asymptotically. Using this result, we 
are able to approximate the orders of the numerator and denominator of condition Q; 
J2iidi - d)^ = lV(m 3 - 2m2mi -I- 2mf) is and J^iidi - d)^ = N{m 2 - 

is Substituting into Q, we see that the numerator is of order 

the denominator is of order and therefore the ratio is of order 0{N°‘~^). 

Hence for 1 < a < 2, the limit in ( [I^ is 0. By following a similar procedure, it can be 
proved that the normality condition is also satisfied when a > 3. 


5 Differential subnetwork detection 

In this section we leverage the test statistic of Section to detect a differential subnet¬ 
work. When comparing the two networks, the expectation is that only a subset of edges 
would present altered interaction patterns. This task is formulated here as the prob¬ 
lem of detecting a subset V* G V for which there is no sufficient evidence to reject the 
null hypothesis that the corresponding subnetworks A*{V*,Ea>) and B*{V*, Eb*) are 
statistically independent. 

An algorithm for the detection of V* should take into account the fact that a certain 
degree of topological difference between A and B is always bound to be observed, even 
when the two population networks are the same, due to finite sample variability. The 
GHD test provides an efficient way to assess the statistical significance of any observed 
discrepancy between two paired networks, and is used as a building block to derive an 
algorithm that identifies differential subnetworks. 
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We indicate by Vk a subset of V of size K < N, and define the centralised GHD test 
statistic computed by comparing A = {Vk,Ea) and B = {Vk,Eb) by 

= GRBiAiVK,EA),BiVK,EB)) - ( 12 ) 

where ^Vk is the mean of the permutation distribution for node set Vk- Furthermore we 
define Ay^p to be the centralised GHD value computed by comparing the two networks 
after removal of node i. The quantity 

Si = Ay^|j - Ay^, 

measures the influence that node i has on the mean-centred GHD test when comparing 
two subnetworks defined on set Vk- We propose an iterative procedure which removes 
a node or set of nodes at each step, and generates a sequence of node sets of increasing 
smaller size, i.e. 

Vn a Vn-1 a ... a VAr„,„, 

where Aj^in < A is a constant indicating the smallest allowed size of subnetwork. Starting 
with Vk-, the two corresponding networks are compared by the GHD test, and a p-value 
is computed, as described previously. For each node indexed hy i = 1,...,A, the corre¬ 
sponding Si is computed, and the node associated with the largest positive Si value is 
removed. Given a new set Vk-i, the process is then repeated again, and then again until 
a specihed minimal set size is reached. 


Differential subnetwork: statistical significance 
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Figure 2: Sequence of adjusted p-values produced by the dGHD algorithm as a function of 
subnetwork size. The size of the subnetwork is progressively reduced by removing nodes that 
further increase the distance between the subnetworks. For this example we simulated 2D 
RG networks of size 1,000 and subnetworks of size 200. 
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This simple algorithm produces a monotonic sequence of p-values that increases as the 
subnetwork size decreases (e.g. see Figurej^. The p-values should be adjusted for multiple 
testing, e.g. by controlling the false discovery rate In the presence of a differential 
subnetwork, the sequence is expected to feature a peak corresponding to the size of the 
subnetwork. Specifically, for a given desired significance level a, the algorithm finds the 
largest K, with N > K > A^min, such that the adjusted p-value exceeds a. Clearly the 
algorithm benefits from the fact that p-values at each iteration can be computed very 
quickly in closed-form. 


6 Simulation experiments 


In this section we report on three different simulation experiments that have been carried 
out to study the properties of the proposed methodology. Our simulations make use of 
RG networks, which are plausible models for biological networks [351133 131 Hi]- Two- 
dimensional RG networks were generated by first uniformly sampling N points on [0,1]^, 
each one corresponding to a node in the graph. A pair of nodes was connected by an edge 
if the Euclidean distance between the corresponding two-dimensional points was smaller 
than a pre-determined threshold d. 

The purpose of the first simulation study was to confirm the asymptotic null sampling 
distribution of the GHD statistic. In this case we randomly generated 10,000 pairs of 
networks A and B of size N = 250, with parameters d = 0.3 and d = 0.15. For each d 
value, paired networks were independently generated, and the GHD test was computed 
to detect differences between them. As a result of this process, we obtained an empirical 
distribution of p-values. Under the null, this distribution is expected to be uniform on 
[0,1], and the resulting QQ plots confirm that the empirical moments of this distribution 
agree perfectly with the expected theoretical moments for a RG model; see Figure 

In the second study, we compared the ability of the GHD test to detect differential 
networks against three competing tests: Mean Absolute Difference (MAD) [9], Quadratic 
Assignment Procedure (QAP) ^3 ^md Gonditional Uniform Graph (GUG) [3]. The MAD 
test counts the number of different edges in the two networks 

MAD(M, B) = ^ ^ - h, I, (13) 


where and bij correspond to the (i,j) elements in the adjacency matrices of A and 
B, respectively. The QAP uses edge set product statistics to test for the independence 
between networks, 

QAP(M, H) = (14) 

where and bij are again elements of the adjacency matrices. For both the MAD 
and QAP tests we also used the traditional permutation testing approach. We further 
included in the study the GUG approach. According to this procedure, random networks 
are generated with pre-determined properties, such as size and density, matching the 
properties of the observed networks. For each simulated pair of random networks, a 
measure of correlation between networks is computed, and its empirical distribution is 
built up over many simulations. The correlation coefficient is defined as: 


gcor(M,H) = ^(oy - 


Sij ^ij s 

N{N -1)’^ 


where and bij are elements of the adjacency matrices for A and B, respectively [45] . 
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Figure 3: QQ plots confirming the asymptotic null distribution of the GHD test. For RG model, 10,000 
paired networks of size 250 were independently generated and the GHD test was applied to detect differences 
between them. An empirical distribution of p-values was obtained through 10, 000 comparison for each model, 
which under the null is expected to be uniformly distributed on [0,1]. The figure shows that empirical and 
theoretical moments agree. 


This experiment required the simulation of paired networks with a pre-specified degree 
of topological dissimilarity. This was achieved by generating A first, using one of the two 
random models as described above. Network B was then obtained by first making an 
exact copy of A, and then randomly shuffling a fixed proportion 7 of edges so that, as 
7 increases, the dissimilarity between A and B increases. For each given value of 7 , we 
generated 1,000 pairs of networks, computed the tests and corresponding p-values, and 
evaluated the proportion of tests that rejected the null hypothesis of independence at a 
5% significance level. The results of this study are summarised in Figure where the 
’’power” is defined as the the proportion of replications, out of 1 , 000 , when we accept the 
null hypothesis of independence. This rises from zero at 7 = 0.84, when networks are still 
associated, to close to 1 when a lot of shuffling has been carried out, to produce nearly 
independent networks. This figure shows that for noise levels as large as 7 = 0.93, the 
tests based on HD consider the two networks to be strongly associated. It is only when 
reaching that threshold that their power starts increasing rapidly away from zero. This 
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Power 


Power comparisons: RG network 



Figure 4; The ’’power” is defined as the proportion of replications when the null hypothesis of independence 
is not rejected. As the noise level 7 increases, the GHD test has more power to detect true structural changes 
compared to competing methods. Simulations are based on 2D RG networks. 


suggests that the tests based on HD may be too stringent for real application and miss 
importance differential patterns. By contract, the GHD test is able to detect differences 
at lower noise levels compared to other tests and capture more subtle differences. This is 
not surprising as GHD is more sensitive to topological changes, as seen in Figure 

In the third simulation study, we carried out an investigation to assess the behaviour 
of the differential subnetwork detection algorithm, and quantify its performance in com¬ 
parison with other tests. We report on experiments involving RG networks A and B of 
size 1,000 and generated as described above using a noise parameter 7 . Two independent 
subnetworks, denoted here by A* and B*, were introduced by randomly selecting a subset 
F* C F of size 200, and replacing the existing edges with connections simulated from 
two independent RG networks. For each value of 7 , we generated 100 such paired large 
networks containing smaller differential subnetworks. We term a true positive (TP) a 
node that is correctly identified as belonging to the differential subnetwork, and a false 
negative (FN) a node that belongs to the subnetwork but has not been detected by the 
algorithm. Similarly we define false positives (FP) and true negatives (TN). In Table 
we report the sensitivity or true positive rate (TPR) computed as TP/(TP-|-FN), and 
the specificity (SPG) computed as TN/(FP-|-TN). For comparative purposes, we have 
also implemented an alternative algorithm, called dHD, which is similar to dGHD but 
uses the Hamming distance instead for distance calculations. As can be observed, both 
dHD and dGHD maintain high sensitivity and specificity up to moderately high noise 
levels. For noise levels at the top end of the spectrum, dHD has slightly higher sensitivity 
but much smaller specificity than dGHD, indicating that it detects a larger number of 
incorrect nodes. 

Figure [^provides an example of simulated networks A and B and ground truth differ¬ 
ential subnetworks A* and B* as well as the differential subnetworks A* and B* detected 
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by dGHD in one of the 100 simulations. The corresponding sequence of p-values generated 
by running the dGHD algorithm in this example is shown in Figure It can be noticed 
how the null hypothesis of independence is rejected for all the subnetworks of size ranging 
from 1000 down to 200, at which point there is no evidence to reject the null, and the 
algorithm produced large p-values for all sizes smaller than 200. 



Figure 5: Example of differential subnetworks detected by dGHD using 2D RG networks. .4* and B* are 
the true simulated independent subnetworks, and A* and B* are the subnetworks detected by the algorithm 
(7 ~ 0.23). Nodes belongs to differential subnetwork are coloured in green, and edges colored red. Please refer 
to Table for full results. 


Table 1: Sensitivity (TPR) and specificity (SPC) of the subnetwork detection 
algorithms for different values of 7 , the noise level. The results are based on 
simulated RG networks. 



7 

0.055 

0.11 

0.23 

0.54 

0.79 

0.95 

dGHD 

TPR 

SPC 

0.897 

0.987 

0.889 

0.984 

0.855 

0.974 

0.627 

0.912 

0.570 

0.768 

0.789 

0.439 

dHD 

TPR 

SPC 

0.914 

0.978 

0.904 

0.971 

0.872 

0.956 

0.725 

0.843 

0.712 

0.567 

0.862 

0.201 
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7 Application to co-methylation networks in ovarian 
cancer 

We present an application to a case-control epigenetic study of ovarian cancer. The 
dataset for this study was originally presented in m- Methylation profiles for 27, 578 
CpGs islands were obtained from whole blood samples in 540 women, of which 266 were 
samples taken from postmenopausal women with ovarian cancer and 274 were from age- 
matched healthy controls. In our analysis we set out to compare control and case DNA 
co-methylation networks in search of a differential subnetwork. 

Raw data files were downloaded from GEO (repos, number GSE19711), and were 
obtained from Illumina Infinium 27k Human DNA methylation Beadchip vl.2. The raw 
data was pre-processed by using the lumi package in R [2D] . After quantile normalization, 
PGA applied to the beta value was used to detect and remove extreme outliers. After 
quality control, 243 control samples and 215 case samples remained for further analysis. 
The networks was inferred by taking each probe as a node. Eollowing |26j . an adjacency 
measure was computed as uiij = |(1 -I- C 0 T{gi, gj))/2\^ where cor{gi,gj) denotes the Pear¬ 
son’s correlation coefficient between beta values observed at the and GpG sites. 
The power exponent b was set to a default value of 12 so as to place more emphasis on 
higher positive correlations [49] . Two nodes were linked in the network if was higher 
than 0.2 so that the presence of an edge indicates a strong correlation. This value also 
yields networks that roughly follow a SF model (see Figure [^. The number of resulting 
edges is 48, 224 and 75,913 in the control network A and case network B, respectively. 

Control network Case network 


O 



Log node degree 


Figure 6: Node degree distribution for control and 
networks roughly follow SF network models. 


o 



Log node degree 

co-methylation networks. Both plots shows that the 


At a significance level of 5% and after correction for multiple testing, the dGHD 
algorithm detected a subnetwork of size 1,642, with 1,954 edges in A* and 12,556 edges 
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Figure 7: DNA co-methylation networks: differential subnetworks A* (controls) and B* (cases) detected by 
dGHD algorithm. Six main communities within the subnetworks are characterised by a much higher network 
density in cancer patients compared to healthy controls. Differential methylation is mostly concentrated in C2, 
C3, C5 and Ce (See also Table and Figure . 


in jB* . The two resulting subnetworks are presented in Figure Although the algorithm 
does not constrain the differential networks to be connected, they both comprise a number 
of connected subgraphs. The Walktrap community detection algorithm, as implemented 
in the R package iGraph [36], was used to identify communities in these two subnetworks, 
as shown in the Figure. The density of the six largest communities, which are denoted 
Cl,... ,Cg, differs quite substantially between control and cancer networks. In almost all 
communities, the density is much higher in B*, with the exception of Cg, where it is higher 
in A*. 

To gain initial insight into the biological meaning of the subnetworks and the commu¬ 
nities within them we used the R package GOstat to identify enriched Gene Ontology 
(GO) terms within two broad categories. Biological Processes (BP) and Molecular Func¬ 
tions (MF). At a 5% significance level, the hypergeometric test detected 762 BP and 
154 MF statistically significant terms enriched in the subnetworks where most of these 
terms can be found in 6 communities. For instance, the top three BPs were response to 
stimulus, cellular response to stimulus and response to chemical stimulus, and the top 
three MFs were protein binding, collagen binding and RNA polymerase II transcription 
cofactor activity. Furthermore, we carried out a pathway enrichment analysis to identify 
any signihcantly enriched KEGG pathways. At a 5% significance level, 12 pathways were 
found to be enriched, including hematopoietic cell lineage, acute myeloid leukemia, and 
regulation of action cytoskeleton. 

Probes showing statistically significant changes in mean methylation levels were de¬ 
tected by a two-sample SAM statistic as implemented in the R package samir. After 
Benjamini & Hochberg correction for multiple testing, 2, 770 probes were found to be 
differentially methylated (DM) at the 5% significance level. Of these, 620 were also found 
in the differential subnetworks, 90% of which are concentrated in communities € 2 , 03 ,Cg 
and Cg. For example in community C 3 , there are 109 probes in total, half of which (54) 
are differentially methylated. Figure shows the distribution of DM probes in the sub¬ 
networks. These results suggest that a differential analysis based exclusively on detecting 
mean levels of differential methylation may miss important differences that can only be 
identified by comparing the interaction networks. 

Table 2 provides a breakdown of the number of probes, differentially methylated probes 
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Figure 8 ; Visualization of the distribution of differential methylated probes (red) in differential subnetworks 
detected by dGHD in the DNA co-methylation networks. 


(gi), density ratio between control and case subnetworks (Ri), and distribution of enriched 
GO terms and KEGG pathways in the 6 communities (see also Figure]^. Replicated GO 
terms and pathways involved in different communities were excluded in the subtotal. 

In C 5 we found that all top 6 ranked significant BP terms were related to interleukin- 
3 (IL-3), a cytokine that is made by leukocytes and other cells in the body. IL-3 can 
increase the number of leukocytes, neutrophils, and platelets made by the bone marrow 
m- As Myelosuppression induced by chemotherapy is closely related to the effect of IL-3 
in blood cells when suppressing a tumor during the therapy [16] , this may offer a possible 
explanation for the observed enrichment results. A possible explanation for the observed 
difference in the Cq cluster may be related to hypermethylation being linked to cancer 

1301 unj. 

Table 2: DNA co-methylation networks: a summary for different communities 



Cl 

C 2 

C 3 

C 4 

C 5 

Ce 

subtotal 

overall 

# of probes 

418 

66 

109 

34 

347 

200 

1174 

1642 

Qi 

4 

66 

54 

1 

338 

97 

560 

620 

Ri 

.181 

.013 

.012 

0 

.002 

23.4 

.145 

.156 

BP 

320 

25 

38 

22 

236 

54 

568 

762 

MF 

54 

4 

15 

3 

43 

27 

125 

154 

KEGG 

5 

0 

1 

1 

0 

1 

8 

12 


8 Conclusions 

The comparison of DNA methylation or gene expression profiles across conditions is en¬ 
abling the discovery of novel biomarkers for diagnosis or prognosis, and holds the promise 
to identify novel targets for therapeutical intervention. In this paper we have discussed 
the problem of comparing two labelled networks that are representative of two conditions 
(e.g. healthy and diseased tissues) and detecting statistically significant differences in 
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their topology. Identifying disrupted interaction patterns in two labelled network com¬ 
parisons is a challenging problem requiring novel statistical tools, and three contributions 
have been made here in this direction. Firstly, we have proposed the GHD, a distance 
between two labelled networks that detects more subtle differences compared to the tra¬ 
ditional Hamming distance. Secondly, we have demonstrated that the GHD can be used 
as a non-parametric test to assess whether two paired networks are statistically indepen¬ 
dent, and have described how p-values can be computed in closed-form without requiring 
computationally expensive permutation procedures. The plausibility of the conditions un¬ 
derpinning our derivations has been discussed using scale-free random network models as 
an example. Thirdly, we have proposed a fast subnetwork detection procedure, the dGHD 
algorithm, to detect localized topological differences between two paired networks. This 
methodology provides a useful addition to standard two-sample tests that are commonly 
used for biomarker discovery. An initial evaluation has been carried out by comparing co- 
methylation networks inferred from healthy and cancer patients, and detecting differential 
subnetworks. Further experimental evaluation on independent datasets will be required 
to validate these results. In future work, the methodology could be extended to the case 
of more than two conditions. 
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